Species in the same lake are supposed to hybridize occasionally, so that they may share variation along large portions of the genome. That is: the \(F_{ST}\) between species in the same lake are expected to be low, in general. But loci that contribute to the adaptation of an ecomorph must remain divergent between ecomorphs in the same lake. In a background of low \(F_{ST}\) values, it should be easy to spot the positions with high \(F_{ST}\) values.

Alpine region

We will compare first C. fatioi (Felchen ecomorph, Thun and Brienz lakes) with C. steinmanni (Balchen ecomorph, in Thun). We use vcftools (Danecek et al. 2011) to estimate weighted \(F_{ST}\)s (Weir and Cockerham 1984) in overlapping genomic windows of one million base pairs.

VCF=../2022-04-22/assem2_outfiles/assem2.vcf
FISHDATA=../../data/Fish_clean2.tsv

if [ ! -d FST ]; then mkdir FST; fi

if [ ! -e FST/Thun-Brienz.windowed.weir.fst ]; then
   if [ ! -e dirty.recode.vcf ]; then
      vcftools --vcf $VCF \
               --out dirty \
               --remove ../2022-05-04/TheWorst.txt \
               --recode \
               --recode-INFO-all > vcftools.log 2>&1 
   fi

   if [ ! -e C.fatioi.txt ]; then
      gawk '($19 == "C.fatioi"){print $1}' $FISHDATA > C.fatioi.txt
   fi

   if [ ! -e C.steinmanni ]; then
      gawk '($19 == "C.steinmanni"){print $1}' $FISHDATA > C.steinmanni.txt
   fi

   vcftools --vcf dirty.recode.vcf \
            --out FST/Thun-Brienz \
            --thin 500 \
            --remove-indels \
            --mac 2 \
            --max-alleles 2 \
            --max-missing 1 \
            --max-meanDP 1000 \
            --weir-fst-pop C.fatioi.txt \
            --weir-fst-pop C.steinmanni.txt \
            --fst-window-size 1000000 \
            --fst-window-step  250000 >> vcftools.log 2>&1
fi
suppressMessages(library(Biostrings))
suppressMessages(library(GenomicRanges))
suppressMessages(library(ggplot2))
ref             <- readDNAStringSet('../../data/reference.fa')
# The names of chromosomes must be only the first word in the fasta name line.
names(ref)      <- sapply(strsplit(names(ref), " "), '[[', 1)
# the refWidth table contains the lengths of chromosomes.
refWidth        <- width(ref)
names(refWidth) <- names(ref)

fst1 <- read.table('FST/Thun-Brienz.windowed.weir.fst',
                  header = TRUE, stringsAsFactors = FALSE)
fst1$POS        <- (fst1$BIN_END + fst1$BIN_START) / 2
offset          <- cumsum(refWidth)[1:39]
names(offset)   <- names(ref)[2:40]
offset['LR778253.1'] <- 0
fst1$offset      <- offset[fst1$CHROM]
fst1$POS2        <- fst1$POS + fst1$offset
ChromCol        <- rep(c(1,2), 20)
names(ChromCol) <- names(ref)
fst1$ChromCol    <- ChromCol[fst1$CHROM]

# Thun-Brienz:
thrs1 <- quantile(fst1$WEIGHTED_FST, probs = 0.96)
f1 <- fst1$WEIGHTED_FST >= thrs1

ggplot(data = fst1, mapping = aes(x = POS2, y = WEIGHTED_FST, color = ChromCol)) +
   geom_point() + theme(legend.position="none") +
   xlab('Position') + ylab(expression(F[ST])) +
   ggtitle('Felchen vs. Balchen ecomoprhs (Thun-Brienz lakes)') +
   geom_hline(yintercept = round(thrs1, 2), color = "red") +
   theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

Above, \(F_{ST}\) values are plotted against their positions in the genome, with consecutive chromosomes in alternative colors. Every data point is a weighted \(F_{ST}\) value among a few SNPs in a 1000000-bp genomic window. Large overlaps between adjacent windows would make a smoother line in individual chromosomes, but give the false impression of a lot of data.

Some \(F_{ST}\) values are negative, but close to zero. Negative \(F_{ST}\) values must mean that individuals from the same species are less similar among them than to members of the other species, at least in these samples. I assume the low magnitude of negative values is compatible with random sampling variation.

The equivalent comparison between a Felchen and a Balchen ecomorph in a different lake is the comparsion of C. zuerichensis (Felchen) and C. duplex (Balchen), both from the Zurich and Walen lakes. If natural selection targeted the same loci in both lakes to produce the adaptations that define the Felchen and Balchen ecomorphs, then, several windows with a high \(F_{ST}\) above should also have a high \(F_{ST}\) in the results below. Let’s see.

FISHDATA=../../data/Fish_clean2.tsv

if [ ! -e FST/Walen-Zurich.windowed.weir.fst ]; then
   if [ ! -e C.zuerichensis.txt ]; then
      gawk '($19 == "C.zuerichensis"){print $1}' $FISHDATA > C.zuerichensis.txt
   fi
   
   if [ ! -e C.duplex.txt ]; then
      gawk '($19 == "C.duplex"){print $1}' $FISHDATA > C.duplex.txt
   fi
   vcftools --vcf dirty.recode.vcf \
            --out FST/Walen-Zurich \
            --thin 500 \
            --remove-indels \
            --mac 2 \
            --max-alleles 2 \
            --max-missing 1 \
            --max-meanDP 1000 \
            --weir-fst-pop C.zuerichensis.txt \
            --weir-fst-pop C.duplex.txt \
            --fst-window-size 1000000 \
            --fst-window-step  250000 >> vcftools.log 2>&1
fi
fst2 <- read.table('FST/Walen-Zurich.windowed.weir.fst',
                   header = TRUE, stringsAsFactors = FALSE)
fst2$POS        <- (fst2$BIN_END + fst2$BIN_START) / 2
offset          <- cumsum(refWidth)[1:39]
names(offset)   <- names(ref)[2:40]
offset['LR778253.1'] <- 0
fst2$offset     <- offset[fst2$CHROM]
fst2$POS2       <- fst2$POS + fst2$offset
ChromCol        <- rep(c(1,2), 20)
names(ChromCol) <- names(ref)
fst2$ChromCol   <- ChromCol[fst2$CHROM]

# Walen-Zurich:
thrs2 <- quantile(fst2$WEIGHTED_FST, probs = 0.96)
f2 <- fst2$WEIGHTED_FST >= thrs2

ggplot(data = fst2, mapping = aes(x = POS2, y = WEIGHTED_FST, color = ChromCol)) +
   geom_point() + theme(legend.position="none") +
   xlab('Position') + ylab(expression(F[ST])) +
   ggtitle('Felchen vs. Balchen ecomorphs (Walen-Zurich lakes)') +
   geom_hline(yintercept = round(thrs2, 2), color = "red") +
   theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

In lakes Zurich and Walen, which are connected, \(F_{ST}\) values between the Balchen and Felchen ecomorphs are higher in general, indicating that the species C. zuerichensis and C. duplex are better differentiated genetically than C. fatioi and C. steinmanni in Thun-Brienz.

We assume that the highest values of \(F_{ST}\) between two species in the same lake must be associated with loci adapted to each species’ environment. However, we know that \(F_{ST}\) is a relative measure of divergence between populations, because it is influenced by the levels of diversity within populations (Cruickshank and Hahn 2014). That is: the lower the genetic diversity within populations, the higher \(F_{ST}\) is expected to be. We will proceed to identify and compare windows of high differentiation, and then we will check if they could be an artifact of variable levels of within-population diversity.

The first way to compare highly differentiated genomic regions between lakes consists in: defining a threshold and visualizing the positions of windows above that threshold in both lakes. The threshold is the 96\(^\text{th}\) percentile.

# Thun-Brienz:
thrs1 <- quantile(fst1$WEIGHTED_FST, probs = 0.96)
f1 <- fst1$WEIGHTED_FST >= thrs1
# Walen-Zurich:
thrs2 <- quantile(fst2$WEIGHTED_FST, probs = 0.96)
f2 <- fst2$WEIGHTED_FST >= thrs2

GR1 <- GRanges(
   seqnames = factor(fst1[f1, 'CHROM']),
   ranges = IRanges(start = fst1[f1, 'BIN_START'],
                    end = fst1[f1, 'BIN_END'],
                    names = paste0(fst1[f1, 'CHROM'], ':',
                                   fst1[f1, 'BIN_START'], '-',
                                   fst1[f1, 'BIN_END'],
                                   sep = '')),
   strand = '+',
   seqinfo = Seqinfo(seqnames = names(ref),
                     seqlengths = width(ref))
)
## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 14 out-of-bound ranges located on sequences
##   LR778261.1, LR778266.1, LR778268.1, LR778269.1, LR778270.1, LR778276.1,
##   LR778285.1, and LR778290.1. Note that ranges located on a sequence
##   whose length is unknown (NA) or on a circular sequence are not
##   considered out-of-bound (use seqlengths() and isCircular() to get the
##   lengths and circularity flags of the underlying sequences). You can use
##   trim() to trim these ranges. See ?`trim,GenomicRanges-method` for more
##   information.
GR1 <- trim(GR1)
GR1 <- reduce(GR1)

GR2 <- GRanges(
   seqnames = factor(fst2[f2, 'CHROM']),
   ranges = IRanges(start = fst2[f2, 'BIN_START'],
                    end = fst2[f2, 'BIN_END'],
                    names = paste0(fst2[f2, 'CHROM'], ':',
                                   fst2[f2, 'BIN_START'], '-',
                                   fst2[f2, 'BIN_END'],
                                   sep = '')),
   strand = '+',
   seqinfo = Seqinfo(seqnames = names(ref),
                     seqlengths = width(ref))
)
## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 15 out-of-bound ranges located on sequences
##   LR778254.1, LR778255.1, LR778258.1, LR778262.1, LR778268.1, LR778275.1,
##   LR778283.1, LR778284.1, and LR778292.1. Note that ranges located on a
##   sequence whose length is unknown (NA) or on a circular sequence are not
##   considered out-of-bound (use seqlengths() and isCircular() to get the
##   lengths and circularity flags of the underlying sequences). You can use
##   trim() to trim these ranges. See ?`trim,GenomicRanges-method` for more
##   information.
GR2 <- trim(GR2)
GR2 <- reduce(GR2)

overlaps <- findOverlaps(GR1, GR2, maxgap = 50000)
overlaps
## Hits object with 25 hits and 0 metadata columns:
##        queryHits subjectHits
##        <integer>   <integer>
##    [1]         2           1
##    [2]         8           3
##    [3]        10           5
##    [4]        11           8
##    [5]        19          14
##    ...       ...         ...
##   [21]        83          86
##   [22]        93          91
##   [23]        94          92
##   [24]       102          96
##   [25]       103          98
##   -------
##   queryLength: 116 / subjectLength: 106
AlpOverlaps <- reduce(c(GR1[queryHits(overlaps)],
                        GR2[subjectHits(overlaps)]))

GR <- data.frame(
   chr = c(seqnames(GR1), seqnames(GR2)),
   start = c(start(GR1), start(GR2)),
   end = c(end(GR1), end(GR2)),
   lake = c(rep('Thun-Brienz', length(GR1)),
            rep('Walen-Zurich', length(GR2))),
   height = c(rep(2, length(GR1)),
              rep(1, length(GR2)))
)

for (chr in unique(GR$chr)) {
   f <- GR$chr == chr
   p <- ggplot(data = GR[f,],
               mapping = aes(x = start,
                             y = height,
                             xend = end,
                             yend = height,
                             color = lake)) +
      geom_segment(size = 5) +
      ylim(0,3) +
      xlim(0, width(ref[chr])) +
      theme(aspect.ratio = 0.3,
            axis.text.y = element_blank(),
            axis.line.y = element_blank(),
            axis.title.y = element_blank(),
            axis.ticks.y = element_blank()) +
      ggtitle(chr) + xlab('Position (bp)')
   print(p)
}

Only 25 genomic regions are highly differentiated between ecomorphs in both Alpine lake systems.

Arctic region

First, I will compare P and L ecomorphs from Suopatjavri. I will recover of the text files created in the previous folder (2022-12-23) for the estimation.

if [ ! -e FST/Suo.windowed.weir.fst ]; then
   P=../2022-12-23/SUO_P.txt
   L=../2022-12-23/SUO_L.txt
   
   vcftools --vcf dirty.recode.vcf \
            --out FST/Suo \
            --thin 500 \
            --remove-indels \
            --mac 2 \
            --max-alleles 2 \
            --max-missing 1 \
            --max-meanDP 1000 \
            --weir-fst-pop $P \
            --weir-fst-pop $L \
            --fst-window-size 1000000 \
            --fst-window-step  250000 >> vcftools.log 2>&1
fi
fst3 <- read.table('FST/Suo.windowed.weir.fst',
                  header = TRUE, stringsAsFactors = FALSE)
fst3$POS    <- (fst3$BIN_END + fst3$BIN_START) / 2
offset     <- cumsum(refWidth)[1:39]
names(offset) <- names(ref)[2:40]
offset['LR778253.1'] <- 0
fst3$offset <- offset[fst3$CHROM]
fst3$POS2 <- fst3$POS + fst3$offset
ChromCol <- rep(c(1,2), 20)
names(ChromCol) <- names(ref)
fst3$ChromCol <- ChromCol[fst3$CHROM]

# SUO:
thrs3 <- quantile(fst3$WEIGHTED_FST, probs = 0.96)
f3 <- fst3$WEIGHTED_FST >= thrs3

ggplot(data = fst3, mapping = aes(x = POS2, y = WEIGHTED_FST, color = ChromCol)) +
   geom_point() + theme(legend.position="none") +
   xlab('Position') + ylab(expression(F[ST])) +
   ggtitle('Profundal vs. Littoral ecomorphs (Suopatjavri)') +
   geom_hline(yintercept = round(thrs3, 2), color = "red") +
   theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

Second, I’m goingo to compare P and L from Langfjordvatn. The same as above.

if [ ! -e FST/Lan.windowed.weir.fst ]; then
   P=../2022-12-23/LAN_P.txt
   L=../2022-12-23/LAN_L.txt
   
   vcftools --vcf dirty.recode.vcf \
            --out FST/Lan \
            --thin 500 \
            --remove-indels \
            --mac 2 \
            --max-alleles 2 \
            --max-missing 1 \
            --max-meanDP 1000 \
            --weir-fst-pop $P \
            --weir-fst-pop $L \
            --fst-window-size 1000000 \
            --fst-window-step  250000 >> vcftools.log 2>&1
fi
fst4 <- read.table('FST/Lan.windowed.weir.fst',
                  header = TRUE, stringsAsFactors = FALSE)
fst4$POS    <- (fst4$BIN_END + fst4$BIN_START) / 2
offset     <- cumsum(refWidth)[1:39]
names(offset) <- names(ref)[2:40]
offset['LR778253.1'] <- 0
fst4$offset <- offset[fst4$CHROM]
fst4$POS2 <- fst4$POS + fst4$offset
ChromCol <- rep(c(1,2), 20)
names(ChromCol) <- names(ref)
fst4$ChromCol <- ChromCol[fst4$CHROM]

# LAN:
thrs4 <- quantile(fst4$WEIGHTED_FST, probs = 0.96)
f4 <- fst4$WEIGHTED_FST >= thrs4

ggplot(data = fst4, mapping = aes(x = POS2, y = WEIGHTED_FST, color = ChromCol)) +
   geom_point() + theme(legend.position="none") +
   xlab('Position') + ylab(expression(F[ST])) +
   ggtitle('Profundal vs. Littoral ecomorphs (Langfjordvatn)') +
   geom_hline(yintercept = round(thrs4, 2), color = "red") +
   theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

# SUO:
thrs3 <- quantile(fst3$WEIGHTED_FST, probs = 0.96)
f3 <- fst3$WEIGHTED_FST >= thrs3
# LAN:
thrs4 <- quantile(fst4$WEIGHTED_FST, probs = 0.96)
f4 <- fst4$WEIGHTED_FST >= thrs4

GR3 <- GRanges(
   seqnames = factor(fst3[f3, 'CHROM']),
   ranges = IRanges(start = fst3[f3, 'BIN_START'],
                    end = fst3[f3, 'BIN_END'],
                    names = paste0(fst3[f3, 'CHROM'], ':',
                                   fst3[f3, 'BIN_START'], '-',
                                   fst3[f3, 'BIN_END'],
                                   sep = '')),
   strand = '+',
   seqinfo = Seqinfo(seqnames = names(ref),
                     seqlengths = width(ref))
)
## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence
##   LR778269.1. Note that ranges located on a sequence whose length is
##   unknown (NA) or on a circular sequence are not considered out-of-bound
##   (use seqlengths() and isCircular() to get the lengths and circularity
##   flags of the underlying sequences). You can use trim() to trim these
##   ranges. See ?`trim,GenomicRanges-method` for more information.
GR3 <- trim(GR3)
GR3 <- reduce(GR3)

GR4 <- GRanges(
   seqnames = factor(fst4[f4, 'CHROM']),
   ranges = IRanges(start = fst4[f4, 'BIN_START'],
                    end = fst4[f4, 'BIN_END'],
                    names = paste0(fst4[f4, 'CHROM'], ':',
                                   fst4[f4, 'BIN_START'], '-',
                                   fst4[f4, 'BIN_END'],
                                   sep = '')),
   strand = '+',
   seqinfo = Seqinfo(seqnames = names(ref),
                     seqlengths = width(ref))
)
## Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 18 out-of-bound ranges located on sequences
##   LR778254.1, LR778265.1, LR778266.1, LR778268.1, LR778270.1, LR778271.1,
##   and LR778273.1. Note that ranges located on a sequence whose length is
##   unknown (NA) or on a circular sequence are not considered out-of-bound
##   (use seqlengths() and isCircular() to get the lengths and circularity
##   flags of the underlying sequences). You can use trim() to trim these
##   ranges. See ?`trim,GenomicRanges-method` for more information.
# vcftools creates windows out of the end of the chromosome.
GR4 <- trim(GR4)
GR4 <- reduce(GR4)

overlaps <- findOverlaps(GR3, GR4, maxgap = 10000)
overlaps
## Hits object with 28 hits and 0 metadata columns:
##        queryHits subjectHits
##        <integer>   <integer>
##    [1]         1           1
##    [2]         2           4
##    [3]         2           5
##    [4]         4           7
##    [5]        12          16
##    ...       ...         ...
##   [24]        75          99
##   [25]        78         103
##   [26]        79         104
##   [27]        80         106
##   [28]        81         109
##   -------
##   queryLength: 94 / subjectLength: 114
ArcOverlaps <- reduce(c(GR3[queryHits(overlaps)],
                        GR4[subjectHits(overlaps)]))
ArcOverlaps
## GRanges object with 23 ranges and 0 metadata columns:
##          seqnames            ranges strand
##             <Rle>         <IRanges>  <Rle>
##    [1] LR778253.1 38000001-39750000      +
##    [2] LR778253.1 70000001-73750000      +
##    [3] LR778253.1 83500001-85500000      +
##    [4] LR778256.1 88250001-92000000      +
##    [5] LR778259.1  4250001-12250000      +
##    ...        ...               ...    ...
##   [19] LR778281.1 17250001-19500000      +
##   [20] LR778283.1 24000001-26250000      +
##   [21] LR778283.1 30750001-35250000      +
##   [22] LR778285.1 18500001-22000000      +
##   [23] LR778286.1  7000001-10000000      +
##   -------
##   seqinfo: 40 sequences from an unspecified genome
GRn <- data.frame(
   chr = c(seqnames(GR3), seqnames(GR4)),
   start = c(start(GR3), start(GR4)),
   end = c(end(GR3), end(GR4)),
   lake = c(rep('Suopatjavri', length(GR3)),
            rep('Langfjordvatn', length(GR4))),
   height = c(rep(1, length(GR3)),
              rep(2, length(GR4)))
)

for (chr in unique(GRn$chr)) {
   f <- GRn$chr == chr
   p <- ggplot(data = GRn[f,],
               mapping = aes(x = start,
                             y = lake,
                             xend = end,
                             yend = lake,
                             color = lake)) +
      geom_segment(size = 5) +
      #ylim(0,3) +
      xlim(0, width(ref[chr])) +
            theme(aspect.ratio = 0.3,
            axis.text.y = element_blank(),
            axis.line.y = element_blank(),
            axis.title.y = element_blank(),
            axis.ticks.y = element_blank()) +
      ggtitle(chr) + xlab('Position (bp)')
   print(p)
}

It turns out that a few genomic regions are highly differentiated between ecomorphs simultaneously in the two Arctic lakes. Now, the question is if any of the 28 regions presumably important in ecological adaptation in the Arctic are also important for the adaptation in the Alpine region.

crossRegionOverlaps <- findOverlaps(AlpOverlaps,
                                    ArcOverlaps,
                                    maxgap = 10000)
crossRegionOverlaps
## Hits object with 5 hits and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         6           5
##   [2]        15          12
##   [3]        19          16
##   [4]        22          21
##   [5]        24          23
##   -------
##   queryLength: 24 / subjectLength: 23
chromHighFst <- as.character(
   unique(seqnames(AlpOverlaps[queryHits(crossRegionOverlaps)]))
)
fst1$lake <- 'Thun-Brienz'
fst1$height <- 3
fst2$lake <- 'Walen-Zurich'
fst2$height <- 4
fst3$lake <- 'Suopatjavri'
fst3$height <- 1
fst4$lake <- 'Langfjordvatn'
fst4$height <- 2
AllTables <- rbind(fst1[f1,], fst2[f2,], fst3[f3,], fst4[f4,])
for (chr in chromHighFst) {
   f <- AllTables$CHROM == chr
   p <- ggplot(data = AllTables[f,],
               mapping = aes(x = BIN_START,
                             y = lake,
                             xend = BIN_END,
                             yend = lake,
                             color = WEIGHTED_FST)) +
      geom_segment(size = 5) + xlab('Position (bp)') +
      ggtitle(chr) +
      guides(colour=guide_legend(title=expression(F[ST])))
   print(p)
}

There are 5 regions, in 5 chromosomes with approximate overlap among the 4 lakes. These are regions of interest that should be studied further to see what genes are there.

Independent windows

For a more formal test of coincidence of highly differentiated regions between two lakes, we need \(F_{ST}\) estimates independent of each other, that is, from non-overlaping windows.

if [ ! -e FST/Thun-Brienz.ind.windowed.weir.fst ]; then
   vcftools --vcf dirty.recode.vcf \
            --out FST/Thun-Brienz.ind \
            --thin 500 \
            --remove-indels \
            --mac 2 \
            --max-alleles 2 \
            --max-missing 1 \
            --max-meanDP 1000 \
            --weir-fst-pop C.fatioi.txt \
            --weir-fst-pop C.steinmanni.txt \
            --fst-window-size 1000000 \
            --fst-window-step 1000000 >> vcftools.log 2>&1
fi
if [ ! -e FST/Walen-Zurich.ind.windowed.weir.fst ]; then
   vcftools --vcf dirty.recode.vcf \
            --out FST/Walen-Zurich.ind \
            --thin 500 \
            --remove-indels \
            --mac 2 \
            --max-alleles 2 \
            --max-missing 1 \
            --max-meanDP 1000 \
            --weir-fst-pop C.zuerichensis.txt \
            --weir-fst-pop C.duplex.txt \
            --fst-window-size 1000000 \
            --fst-window-step 1000000 >> vcftools.log 2>&1
fi

if [ ! -e FST/Suo.ind.windowed.weir.fst ]; then
   vcftools --vcf dirty.recode.vcf \
            --out FST/Suo.ind \
            --thin 500 \
            --remove-indels \
            --mac 2 \
            --max-alleles 2 \
            --max-missing 1 \
            --max-meanDP 1000 \
            --weir-fst-pop ../2022-12-23/SUO_P.txt \
            --weir-fst-pop ../2022-12-23/SUO_L.txt \
            --fst-window-size 1000000 \
            --fst-window-step 1000000 >> vcftools.log 2>&1
fi

if [ ! -e FST/Lan.ind.windowed.weir.fst ]; then
   vcftools --vcf dirty.recode.vcf \
            --out FST/Lan.ind \
            --thin 500 \
            --remove-indels \
            --mac 2 \
            --max-alleles 2 \
            --max-missing 1 \
            --max-meanDP 1000 \
            --weir-fst-pop ../2022-12-23/LAN_P.txt \
            --weir-fst-pop ../2022-12-23/LAN_L.txt \
            --fst-window-size 1000000 \
            --fst-window-step 1000000 >> vcftools.log 2>&1
fi
fst1 <- read.table('FST/Thun-Brienz.ind.windowed.weir.fst',
                   header = TRUE, stringsAsFactors = FALSE)
fst2 <- read.table('FST/Walen-Zurich.ind.windowed.weir.fst',
                   header = TRUE, stringsAsFactors = FALSE)
fst3 <- read.table('FST/Suo.ind.windowed.weir.fst',
                   header = TRUE, stringsAsFactors = FALSE)
fst4 <- read.table('FST/Lan.ind.windowed.weir.fst',
                   header = TRUE, stringsAsFactors = FALSE)

Bivariate distribution of \(F_{ST}\)s

The plots below give an idea of how few genomic regions are highly differentiated between ecomorphs simultaneously in two lake systems. The comparisons between lakes from different regions must be quite pointless, because we expecte even less coincidence.

row.names(fst1) <- paste0(fst1$CHROM, ':', fst1$BIN_START, sep = '')
row.names(fst2) <- paste0(fst2$CHROM, ':', fst2$BIN_START, sep = '')
row.names(fst3) <- paste0(fst3$CHROM, ':', fst3$BIN_START, sep = '')
row.names(fst4) <- paste0(fst4$CHROM, ':', fst4$BIN_START, sep = '')
f <- intersect(intersect(row.names(fst1), row.names(fst2)),
               intersect(row.names(fst3), row.names(fst4)))
commonFst <- data.frame(
   chr    = fst1[f, 'CHROM'],
   start  = fst1[f, 'BIN_START'],
   end    = fst1[f, 'BIN_END'],
   Thu    = fst1[f, 'WEIGHTED_FST'],
   Wal    = fst2[f, 'WEIGHTED_FST'],
   Suo    = fst3[f, 'WEIGHTED_FST'],
   Lan    = fst4[f, 'WEIGHTED_FST']
)

ggplot(commonFst, aes(x=Thu, y=Wal)) + geom_hex() +
   xlab('Lakes Thun and Brienz') +
   ylab('Lakes Walen and Zurich') +
   ggtitle('Alpine region')

ggplot(commonFst, aes(x=Suo, y=Lan)) + geom_hex() +
   xlab('Suopatjavri') + ylab('Langfjordvatn') +
   ggtitle('Arctic region')

Kolmogorov-Smirnof

thrs1 <- quantile(commonFst$Thu, probs = 0.9)
thrs2 <- quantile(commonFst$Wal, probs = 0.9)
thrs3 <- quantile(commonFst$Suo, probs = 0.9)
thrs4 <- quantile(commonFst$Lan, probs = 0.9)

ks.results <- data.frame(
   chr = unique(commonFst$chr),
   WalThu = NA,
   ThuWal = NA,
   LanSuo = NA,
   SuoLan = NA
)

row.names(ks.results) <- ks.results$chr

for (chr in unique(commonFst$chr)) {
   HighFstThu <- commonFst[commonFst$Thu >= thrs1 & commonFst$chr == chr,]
   HighFstWal <- commonFst[commonFst$Wal >= thrs2 & commonFst$chr == chr,]
   HighFstSuo <- commonFst[commonFst$Suo >= thrs3 & commonFst$chr == chr,]
   HighFstLan <- commonFst[commonFst$Lan >= thrs4 & commonFst$chr == chr,]
   if (dim(HighFstThu)[1] >= 5) {
      ks.results[chr, 'ThuWal'] <- ks.test(HighFstThu$Wal,
                                           commonFst$Wal, exact = FALSE,
                                           alternative = 'greater')$p.value
   }
   if (dim(HighFstWal)[1] >= 5) {
      ks.results[chr, 'WalThu'] <- ks.test(HighFstWal$Thu,
                                           commonFst$Thu, exact = FALSE,
                                           alternative = 'greater')$p.value
   }
   if (dim(HighFstLan)[1] >= 5) {
      ks.results[chr, 'LanSuo'] <- ks.test(HighFstLan$Suo,
                                           commonFst$Suo, exact = FALSE,
                                           alternative = 'greater')$p.value
   }
   if (dim(HighFstSuo)[1] >= 5) {
      ks.results[chr, 'SuoLan'] <- ks.test(HighFstSuo$Lan,
                                           commonFst$Lan, exact = FALSE,
                                           alternative = 'greater')$p.value
   }
}
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties

## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties

## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstWal$Thu, commonFst$Thu, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties

## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties

## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstThu$Wal, commonFst$Wal, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstLan$Suo, commonFst$Suo, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties
## Warning in ks.test(HighFstSuo$Lan, commonFst$Lan, exact = FALSE, alternative =
## "greater"): p-value will be approximate in the presence of ties

Session Information

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
## [1] ggplot2_3.3.6        GenomicRanges_1.38.0 GenomeInfoDb_1.22.1 
## [4] Biostrings_2.54.0    XVector_0.26.0       IRanges_2.20.2      
## [7] S4Vectors_0.24.4     BiocGenerics_0.32.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.2       xfun_0.31              bslib_0.3.1           
##  [4] purrr_0.3.4            lattice_0.20-45        colorspace_2.0-3      
##  [7] vctrs_0.5.0            generics_0.1.3         htmltools_0.5.2       
## [10] yaml_2.3.5             utf8_1.2.2             rlang_1.0.6           
## [13] hexbin_1.28.2          jquerylib_0.1.4        pillar_1.8.1          
## [16] glue_1.6.2             withr_2.5.0            DBI_1.1.2             
## [19] GenomeInfoDbData_1.2.2 lifecycle_1.0.3        stringr_1.4.0         
## [22] zlibbioc_1.32.0        munsell_0.5.0          gtable_0.3.1          
## [25] evaluate_0.15          labeling_0.4.2         knitr_1.39            
## [28] fastmap_1.1.0          fansi_1.0.3            highr_0.9             
## [31] scales_1.2.1           jsonlite_1.8.0         farver_2.1.1          
## [34] digest_0.6.30          stringi_1.7.8          dplyr_1.0.9           
## [37] grid_3.6.3             cli_3.4.1              tools_3.6.3           
## [40] bitops_1.0-7           magrittr_2.0.3         sass_0.4.1            
## [43] RCurl_1.98-1.6         tibble_3.1.8           pkgconfig_2.0.3       
## [46] assertthat_0.2.1       rmarkdown_2.14         rstudioapi_0.13       
## [49] R6_2.5.1               compiler_3.6.3

References

Cruickshank, Tami E., and Matthew W. Hahn. 2014. “Reanalysis Suggests That Genomic Islands of Speciation Are Due to Reduced Diversity, Not Reduced Gene Flow.” Molecular Ecology 23 (13): 3133–57. https://doi.org/10.1111/mec.12796.

Danecek, Petr, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert E. Handsaker, et al. 2011. “The variant call format and VCFtools.” Bioinformatics 27 (15): 2156–8. https://doi.org/10.1093/bioinformatics/btr330.

Weir, B. S., and C. Clark Cockerham. 1984. “Estimating F-Statistics for the Analysis of Population Structure.” Evolution 38 (6): 1358–70.